Polarity and lexical complexity to study social topics in music genres

Goals

This work aims at conducting a sentiment analysis through music genres. There are many ways to complete this goal, however it is important that the parameters and analysers we use are fitting our data. Sentiment analysis is a very large field, and we believe that we used the librairies and functions that are the most appropriate for the chosen dataset. This work can be divided into 4 parts:

  1. Data imports: structures, sorting and wrangling
  2. Classifiers: Choosing the methods of analysis and extracting features
  3. Visualization: How to meaningfully represent the data
  4. Finalization: Organization of the outputs
Dataset credits:

musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at: http://labrosa.ee.columbia.edu/millionsong/musixmatch


In [32]:
#Importing libraries
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import re
import nltk
import scipy
import sklearn
import sklearn.preprocessing
import gensim as gs
import pylab as pl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from IPython.display import display, Image
from nltk.corpus import stopwords
import nltk
import os
import pickle

# internal imports
import helpers as HL


# Constants: PS! put in your own paths to the files
GLOVE_FOLDER = 'glove.twitter.27B'
GS_FOLDER = os.path.abspath("doesnt_matter" + "/../../../../" + "Machine_Learning/CD-433-Project-2/gensim_data_folder/") #PS: this is only in my folderstructuree
GS_25DIM = GS_FOLDER + "/gensim_glove_vectors_25dim.txt"
GS_50DIM = GS_FOLDER + "/gensim_glove_vectors_50dim.txt"
GS_100DIM = GS_FOLDER + "/gensim_glove_vectors_100dim.txt"
GS_200DIM = GS_FOLDER + "/gensim_glove_vectors_200dim.txt"


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

1. Data imports: structures, sorting and wrangling

  • Import the Songs Titles and Artists from MusiXMatch 779052 songs
  • Import the Year from million song additional files 498466 songs
  • Import the Genres from TagTraum 255015 songs
  • Import the lyrics & the bag of words 91625 songs

1.1. Importing the songs

For now we only use titles and artist names, we are able to handle this part with only the musixmatch website. We download the data and put it into a dataframe with the Id of MusiXMatch(MXM_Tid) and the Track ID of the Million Song DataSet(Tid). Because we might have data that is given with one classification or the other, we decide to keep the two IDs, but we are fully aware that having two IDs is not giving additional information, it is only to be sure that other datasets will be easier to merge.
We for now, we get 779052 song's artists and titles


In [2]:
#Importing the text file in a DataFrame, removing exceptions (incomplete data)
matches = pd.read_table('Data/mxm_779k_matches.txt', error_bad_lines=False)

#Changing the column's title in order to be clearer
matches.columns = ['Raw']

#Getting the Tid
matches['Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[0]

#Extracting artist names
matches['Artist_Name'] = matches['Raw'].str.split('<SEP>', expand=True)[1]

#Extracting titles
matches['Title'] = matches['Raw'].str.split('<SEP>', expand=True)[2]

#Extractign MXM_Tid
matches['MXM_Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[3]

#Dropping rows we do not need ()
matches = matches.drop(matches.index[:17])

#Droppign the column with raw data
matches = matches.drop('Raw', axis=1)

#set index Track ID
matches.set_index('Tid',inplace=True)

#Displaying results
display(matches.shape)
display(matches.head())


b'Skipping line 60821: expected 1 fields, saw 2\nSkipping line 126702: expected 1 fields, saw 2\n'
b'Skipping line 580629: expected 1 fields, saw 2\nSkipping line 632526: expected 1 fields, saw 2\n'
(779052, 3)
Artist_Name Title MXM_Tid
Tid
TRMMMKD128F425225D Karkkiautomaatti Tanssi vaan 4418550
TRMMMRX128F93187D9 Hudson Mohawke No One Could Ever 8898149
TRMMMCH128F425532C Yerba Brava Si Vos Querés 9239868
TRMMMXN128F42936A5 David Montgomery Symphony No. 1 G minor "Sinfonie Serieuse"/All... 5346741
TRMMMBB12903CB7D21 Kris Kross 2 Da Beat Ch'yall 2511405
Remarks:
  • There are two distinct identifiers for the same data. Because we might have data that is given with one classification or the other, we decide to keep the two IDs, but we are fully aware that having two IDs is not giving additional information, it is only to be sure that other datasets will be easier to merge.
  • This is only containing the artist and title, we need further informations such as the genre and the bags of words for each song.

1.2. Extracting the Year of the songs

We download the text file from the "A million song" website. It is provided as an additional feature of the dataset.
We merge the year dataset with the artists and song titles in the same dataframe.


In [3]:
#Loading the year of publication data, skipping incomplete data in order to avoid errors
years = pd.read_table('Data/tracks_per_year.txt', error_bad_lines=False)
#Changing the column's title in order to be clearer
years.columns = ['Raw']

#Getting the year publication
years['year'] = years['Raw'].str.split('<SEP>', expand=True)[0]

#Getting the Tid
years['Tid'] = years['Raw'].str.split('<SEP>', expand=True)[1]

#Dropping the raw data
years = years.drop('Raw', axis=1)

#set index Track ID
years.set_index('Tid',inplace=True)

#Appending the years to the original DataFrame
matches = pd.merge(matches, years, left_index=True, right_index=True)


b'Skipping line 487582: expected 1 fields, saw 2\nSkipping line 487590: expected 1 fields, saw 2\n'

In [4]:
#display the results
print(matches.shape)
display(matches.head())


(498466, 4)
Artist_Name Title MXM_Tid year
Tid
TRMMMKD128F425225D Karkkiautomaatti Tanssi vaan 4418550 1995
TRMMMRX128F93187D9 Hudson Mohawke No One Could Ever 8898149 2006
TRMMMCH128F425532C Yerba Brava Si Vos Querés 9239868 2003
TRMMMBB12903CB7D21 Kris Kross 2 Da Beat Ch'yall 2511405 1993
TRMMMNS128F93548E1 3 Gars Su'l Sofa L'antarctique 7503609 2007

Remarks:

We delete the rows without year infos. Thus why the dataframe contains less songs. In order to be able to be as complete as accurate as possible, we consider only full matching.

1.3 Importing genres

We will now append each genre to a specific track.
We download the data from the TagTraum dataset and merge them without our previous dataframe.


In [5]:
#Creating a DataFrame to store the genres:
GenreFrame = pd.read_table('Data/msd-topMAGD-genreAssignment.txt', names=['Tid', 'genre'])

#set index Track ID
GenreFrame.set_index('Tid',inplace=True)

#merge the new datas with the previous dataframe
matches = pd.merge(GenreFrame, matches, left_index=True, right_index=True)

In [6]:
#Displaying results
print(matches.shape)
display(matches.head())


(255015, 5)
genre Artist_Name Title MXM_Tid year
Tid
TRAAAAK128F9318786 Pop_Rock Adelitas Way Scream 8692587 2009
TRAAAAV128F421A322 Pop_Rock Western Addiction A Poor Recipe For Civic Cohesion 4623710 2005
TRAAABD128F429CF47 Pop_Rock The Box Tops Soul Deep 6477168 1969
TRAAAEF128F4273421 Pop_Rock Adam Ant Something Girls 3759847 1982
TRAAAEM128F93347B9 Electronic Son Kite Game & Watch 2626706 2004
Comment:

The dataframe contains once again less songs. We proceed this way for the same reason as mentioned in the part before.

1.4. Importing Location

We download the file with the location of every artist from the additional files


In [7]:
#Creating a DataFrame to store the location:
location = pd.read_csv('Data/artist_location.txt', sep="<SEP>",header=None,names=['ArtistID','Latitude','Longitude','Artist_Name','City'])
#Keep useful datas
location.drop(['ArtistID','City'],inplace=True,axis=1)
 
#matches = pd.merge(location, matches, on='Tid')
matches.reset_index(inplace=True)
matches = pd.merge(location, matches, on='Artist_Name')
matches.set_index('Tid',inplace = True)


/Users/havardbjornoy/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  

In [8]:
#Displaying results
display(matches.head())
print(matches.shape)


Latitude Longitude Artist_Name genre Title MXM_Tid year
Tid
TRBAWHU128EF3563C5 51.59678 -0.33556 Screaming Lord Sutch Pop_Rock Murder In The Graveyard 7883403 1982
TREPTTY128EF3563C1 51.59678 -0.33556 Screaming Lord Sutch Pop_Rock Penny Penny 2546441 1982
TRJSKEX128EF3563C7 51.59678 -0.33556 Screaming Lord Sutch Pop_Rock London Rocker 2546440 1982
TRKCPUR128F92CF37D 51.59678 -0.33556 Screaming Lord Sutch Pop_Rock Jack The Ripper 4439345 1982
TRKYESP128EF3563C0 51.59678 -0.33556 Screaming Lord Sutch Pop_Rock Monster Rock 2546437 1982
(103401, 7)

1.5. Bags of words, extracting them from the train dataset

We downloaded the train datafile which is 30% of the whole dataset. Inside we have a list of the 5000 words the most used in the ... songs. We then make two dataframes:

  • One with the Id of every songs and their lyrics. We merge this with our previous dataframe.

    The lyrics are presented as follow : [(id of word),(occurence in song)][2,24][5,47]...

  • Another one with the 5000 top words of the songs (Bag of Words)

We work with only 30% of the whole dataset because we use the MusicXMatch dataset and it is the only data that are available.
The rest of the data are not free. You could see that page : https://developer.musixmatch.com/plans to verify.


In [9]:
#import file
lyrics = pd.read_table('Data/mxm_dataset_train.txt', error_bad_lines=False)

#change name of the column
lyrics.columns = ['Raw_Training']

# take the bag of word to use it later
words_train = lyrics.iloc[16]

#drop useless rows
lyrics=lyrics[17:].copy()

# get TrackID, MxMID and lyrics and put them separated columns
def sortdata(x):
    splitted = x['Raw_Training'].split(',')
    x['Tid']=splitted[0]
    #x['MXM_Tid']=splitted[1]
    x['words_freq']=splitted[2:]
    return x

#Apply the function to every column
lyrics = lyrics.apply(sortdata,axis=1)
lyrics = lyrics[['Tid','words_freq']]

In [10]:
#set index Track ID
lyrics.set_index('Tid',inplace=True)

#Appending the years to the original DataFrame
matches = pd.merge(matches, lyrics, left_index=True, right_index=True)

In [11]:
#Displaying the results
print(matches.shape)
display(matches.head())


(38513, 8)
Latitude Longitude Artist_Name genre Title MXM_Tid year words_freq
Tid
TRAAAAV128F421A322 37.77916 -122.42005 Western Addiction Pop_Rock A Poor Recipe For Civic Cohesion 4623710 2005 [1:6, 2:4, 3:2, 4:2, 5:5, 6:3, 7:1, 8:1, 11:1,...
TRAAABD128F429CF47 35.14968 -90.04892 The Box Tops Pop_Rock Soul Deep 6477168 1969 [1:10, 3:17, 4:8, 5:2, 6:2, 7:1, 8:3, 9:2, 10:...
TRAAAEF128F4273421 35.83073 -85.97874 Adam Ant Pop_Rock Something Girls 3759847 1982 [1:5, 2:4, 3:3, 4:2, 5:1, 6:11, 9:4, 12:9, 13:...
TRAAAHJ128F931194C 39.74001 -104.99226 Devotchka Pop_Rock The Last Beat Of My Heart (b-side) 5133845 2004 [1:4, 2:11, 3:2, 4:7, 5:3, 6:5, 8:1, 9:3, 10:6...
TRAABIG128F9356C56 40.71455 -74.00712 Poe Pop_Rock Walk the Walk 678806 2000 [1:28, 2:77, 3:31, 4:41, 5:5, 6:13, 8:17, 9:5,...
Comments on the size:

Due to the fact that we do not have access to the entire dataset, our analysis is limited to the 30% that is freely available on MusixMatch.

1.6. From generic bags of words to lyrics

We Create a function that take the list of the word and the occurency in one song : [(id of word),(occurency in the song)][2,24][5,47]...
And output all the corresponding words in a list

For example : [1:2,2:5,3:3] gives us --> [i,i,the,the,the,the,the,you,you,you]


In [12]:
#get the datas
bag_of_words = words_train
# clean the data and split it to create a list of 5000 words
bag_of_words = bag_of_words.str.replace('%','')
bag_of_words = bag_of_words.str.split(',')

display(bag_of_words.head())


Raw_Training    [i, the, you, to, and, a, me, it, not, in, my,...
Name: 16, dtype: object

In [13]:
#Defining a function
def create_text(words_freq):
    #create the final list of all words
    list_words=''
    #iterate over every id of words
    for compteur in words_freq:
        
        word = bag_of_words[0][int(compteur.split(':')[0])-1]
        times = int(compteur.split(':')[1])
        
        #Separating every word with a space to be able to work on it with librairies during part 2
        for i in range(times):
            list_words += ' ' + word + ' '
    return list_words

In [14]:
#Testing the function
print(create_text(lyrics.iloc[0]['words_freq']))


 i  i  i  i  i  i  the  the  the  the  you  you  to  to  and  and  and  and  and  a  a  a  me  it  my  is  is  of  of  of  your  that  are  are  we  we  am  am  will  will  for  for  for  for  be  have  have  so  this  like  like  de  up  was  was  if  got  would  been  these  these  seem  someon  understand  pass  river  met  piec  damn  worth  flesh  grace  poor  poor  somehow  ignor  passion  tide  season  seed  resist  order  order  piti  fashion  grant  captur  captur  ici  soil  patienc  social  social  highest  highest  slice  leaf  lifeless  arrang  wilder  shark  devast  element 
Comments on part one:

As it is noticeable through each step, we loose data every time we merge datasets. We chose this approach because we only want to deal with complete information in order to be coherent. We want to compare parameters between items and we believe that the analysis is less relevant if we consider a larger dataset that contains data incomplete data.

We now have 38 513 songs, but for each one we have all the features that we want to use. We will analyse our data with different parameters, thus why it is important that it each song provides each item. Later in the analysis we may use data from 1.4. (providing 103 401 songs) in order to get a broader overview.

2. Classifiers: Choosing the methods of analysis and extracting features

In order to analyse songs, we will use sentiment analysis on the lyrics. We chose to use 2 key features, which are the polarity and the lexical complexity. Because we only use bags of words, some parameters such as rhymes and structures are not defined with bags of words and they should be taken in consideration when speaking of the whole complexity of lyrics.

2.1. Word polarity

Vader package

VADER, which stands for Valence Aware Dictionary and sEntiment Reasoner, is a sentiment analysis package that provides a polarity score for a given word or sentences. It is known to be a very powerful tool, especially because it was trained on tweets, meaning that it takes into account most of modern vocabulary. This is especially relevant for our project because we deal with modern music, implying that the words that are used are as modern as the ones analysed by VADER on tweets. The fact that the sentiment analyser takes its roots from the same vocabulary is make the analysis more relevant.

Polarity is expressed between -1 (negative polarity) and 1 (positive polarity).


In [15]:
import nltk.sentiment.sentiment_analyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


/Users/havardbjornoy/anaconda3/lib/python3.6/site-packages/nltk/twitter/__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "

In [16]:
#Defining the analyser
analyser = SentimentIntensityAnalyzer()

2.2. Lyrics' complexity

Because we want to be able to know what type of audience a specific type of music is targeting we need to analyse the complexity of the lyrics. We are aware that dividing an audience into social profiles is far beyond the scope of our analysis. We do not have enough sociological knowledge to categorize an audience in a precise way. This is the reason why we will use large indicators. We want to know how complex a set of word is, and the only social assumption we will make is that complexity is correlated with the age and the educational level of the audience.

We use the occurence of each word in the whole dataset.

Extracting the vocabulary

Importing the most used words and their count inside the dataset in order to start a text processing analysis.

Extracting additional features

From the dataset, there was some given metadata. The total word count is 55 163 335.

Because of the long tail effect of language, we will proceed with the first 10 000 words of the list. This will enable us to have less computing time when iterating on the full_word_list.

From the vocabulary we remove stopwords. Those are too often mentionned in every level of language to be relevant for this analysis.

We then compute the percentage of occurence, because it will help us when dealing with lyrics' complexity. We then use the occurence precentage to get a Complexity weight. It means that when a word is used a lot it will have a low weight and a high weight for words rarely used.


In [17]:
Word_count_total = 55163335

#Importing the data, putting it in a DataFrame
full_word_list = pd.read_table('Data/full_word_list.txt')
#Renaming the columns
full_word_list.columns = ['Word']
#Extracting word count
full_word_list['Count'] = pd.to_numeric(full_word_list['Word'].str.split('<SEP>', expand=True)[1])
#Extracted words that were used
full_word_list['Word'] = full_word_list['Word'].str.split('<SEP>', expand=True)[0]
#Dropping rows we will not use
full_word_list = full_word_list.drop(full_word_list.index[:6])

#Extracting the first 50 0000  values, because the rest is not stemmed and not necessarly in english
full_word_list = full_word_list.head(50000)


#Removing english stop words 
for word in full_word_list['Word']:
    if word in stopwords.words('english'):
        full_word_list = full_word_list[full_word_list.Word != word]
        
#Computing the percentage of occurence:
full_word_list['Occurence_percentage'] = (full_word_list['Count']/ Word_count_total)*100
#computing weight of words
full_word_list['Weight']= 1/full_word_list['Occurence_percentage']

display(full_word_list.shape)
display(full_word_list.head())


(49872, 4)
Word Count Occurence_percentage Weight
31 love 298043.0 0.540292 1.850852
34 know 273137.0 0.495142 2.019621
39 like 227624.0 0.412636 2.423441
44 get 192961.0 0.349799 2.858782
46 go 182812.0 0.331401 3.017490
Removing non english words

Because they are much less commonly encountered in the dataset, words that are not in english will be ranked with a very high complexity. In addition to introduce a bias in the lexical complexity analysis, they wil also cause trouble when treating the polarity, because the VADER library is solely analysing english words. We will use the NLTK library in order to remove each non-english word from the bags of words.

We first need to download the "wordnet" NLTK package:


In [18]:
import nltk
#Using the NLTK downloader to get wordnet
#nltk.download()
from nltk.corpus import wordnet as wn

In [19]:
for j in full_word_list.index: 
    if not wn.synsets(full_word_list.Word[j]):#Comparing if word is non-English
        full_word_list.drop(j, inplace=True)

In [20]:
full_word_list = full_word_list.sort_values('Weight', ascending=False)
display(full_word_list.head())
pickle.dump(full_word_list, open("full_word_list.pkl", "wb"))


[autoreload of helpers failed: Traceback (most recent call last):
  File "/Users/havardbjornoy/anaconda3/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 246, in check
    superreload(m, reload, self.old_objects)
  File "/Users/havardbjornoy/anaconda3/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 369, in superreload
    module = reload(module)
  File "/Users/havardbjornoy/anaconda3/lib/python3.6/imp.py", line 314, in reload
    return importlib.reload(module)
  File "/Users/havardbjornoy/anaconda3/lib/python3.6/importlib/__init__.py", line 166, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 608, in _exec
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 781, in get_code
  File "<frozen importlib._bootstrap_external>", line 741, in source_to_code
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
  File "/Users/havardbjornoy/EPFL_Jupyter/ADA/DataAnalysis/project/helpers.py", line 211
    occurence_words_not_in_vocab = 0
                                   ^
IndentationError: unindent does not match any outer indentation level
]
Word Count Occurence_percentage Weight
49995 leipzig 15.0 0.000027 36775.556667
49219 chide 15.0 0.000027 36775.556667
49125 bookmark 15.0 0.000027 36775.556667
49129 bottleneck 15.0 0.000027 36775.556667
49134 bracken 15.0 0.000027 36775.556667

In [21]:
# function to get the complexity of one song by analyzing the weight of all his word
def complexity_Song(lyrics):
    #create a variable to stock the sum of the weights for every word of the song
    sum_weight= 0
    #split the lyrics to get an array of words and not just one big string
    lyric = lyrics.split(' ')
    #lyric = lyric.remove(' ')
    
    #filtering empty values
    lyric = list(filter(None, lyric))
    
    #Removing every english stopword from the given lyric
    lyric = [word for word in lyric if word not in stopwords.words('english')]
    
    for x in lyric:
        #Making sure that the data is not empty
        if len(full_word_list.loc[full_word_list['Word'] == x]['Weight'].values) != 0 :
            sum_weight += full_word_list.loc[full_word_list['Word'] == x]['Weight'].values  

    return float(sum_weight/len(lyric))
Comment:

This implementation is inspired by the TF-IDF algorithm. If the occurence of a word is weak in a dataset, it means that it is less common in the language, meaning that the lexical complexity is higher.

English stopwords are very common in every sentences, they are used so typically that they do not add anything relevant to the analysis. This is the reason why we take them out. Our complexity analysis must be focused on words that do not appear regularly.

Analysis

We need to go from word frequency to bags of words. Once this is done using our "create_text" function, we will use the polarity analyser.


In [22]:
#Resetting index
matches.reset_index(inplace=True)

#Intiating an empty column, in order to be able to interate on it
matches['Bags_of_words'] = ''
#Getting all the textual data in the DataFrame
for i in matches.index:
    matches.at[i, 'Bags_of_words'] = create_text(matches.at[i, 'words_freq'])

#Because we have all the intial data in our DataFrame, we will store it as pickle object
matches.to_pickle('full_table.pkl')

Now that we have the bags of words in the DataFrame, we can conduct the analysis. Let us first work with the polarity:


In [23]:
#taking out the pickle object
matches = pd.read_pickle('full_table.pkl')

#Applying the polarity analysis for the bags of words
for i in matches.index:
    matches.at[i, 'Polarity_score'] = analyser.polarity_scores(matches.at[i, 'Bags_of_words'])['compound']

In [24]:
display(matches.head(4))


Tid Latitude Longitude Artist_Name genre Title MXM_Tid year words_freq Bags_of_words Polarity_score
0 TRAAAAV128F421A322 37.77916 -122.42005 Western Addiction Pop_Rock A Poor Recipe For Civic Cohesion 4623710 2005 [1:6, 2:4, 3:2, 4:2, 5:5, 6:3, 7:1, 8:1, 11:1,... i i i i i i the the the the you yo... 0.7748
1 TRAAABD128F429CF47 35.14968 -90.04892 The Box Tops Pop_Rock Soul Deep 6477168 1969 [1:10, 3:17, 4:8, 5:2, 6:2, 7:1, 8:3, 9:2, 10:... i i i i i i i i i i you you you ... 0.9686
2 TRAAAEF128F4273421 35.83073 -85.97874 Adam Ant Pop_Rock Something Girls 3759847 1982 [1:5, 2:4, 3:3, 4:2, 5:1, 6:11, 9:4, 12:9, 13:... i i i i i the the the the you you ... 0.8689
3 TRAAAHJ128F931194C 39.74001 -104.99226 Devotchka Pop_Rock The Last Beat Of My Heart (b-side) 5133845 2004 [1:4, 2:11, 3:2, 4:7, 5:3, 6:5, 8:1, 9:3, 10:6... i i i i the the the the the the the... 0.8720

Sorting outputs in valuable categories

Because we want a precise data structure, we must aggregate our outputs the most efficient way for later visualization.

We need metadata per topic, per genre and per artist.


In [25]:
sns.set(color_codes=True, style="darkgrid")

def polarity_graph_generator(Data_in, categorization):
    
    Data_in[categorization] = Data_in[categorization].astype('category')

    for cat in Data_in[categorization].cat.categories:
        Division = pd.DataFrame()
        Division = Data_in[(Data_in[categorization] == cat)]
        
        #Sorting values by polarity to create a graph
        Division = Division.sort_values('Polarity_score', ascending=False)
        #Index reseting
        Division = Division.reset_index()
        #plotting the results
        sns_plot = sns.tsplot(Division['Polarity_score'], color='m').set_title('Polarity in {}'.format(cat))
        x = len(Division['Polarity_score'])
        y = Division['Polarity_score']
        ax = sns_plot.axes
        ax.fill_between(x, 0, y)
        fig = sns_plot.get_figure()
        #Storing the graph (MUST GREATE THE FOLDER BEFORE !)
        fig.savefig("Polarity_plots/{} polarity.png" .format(cat))
        
        #Clearing the figure
        fig.clf()
    return

2.3. Topic classification

Having the data divided in genres in important for our analysis, however we are still missing one key dimension to make our work relevant for social good: The topic that is adressed in the songs. We must be able to know which subject is dealt with in a song, and then we will aggregate the data for the genre and we will be able to understand how a particular genre is handling a specific topic. For this part we are still considering two options:


In [27]:
#import global vectors from stanfords pretrained set, trained on tweets, one can choose wished dim=25,50,100,200
global_vectors = HL.load_gensim_global_vectors(GS_200DIM)

Defining topics so we can calculate a words similarity to topic

This is where there might be most bias from us creators. We chose the defining words using thesaurus and the vectorspaces outputs.


In [28]:
#Defining the topics
racism = ['racism', 'nigger', 'negro', 'race', 'racist', 'bigot', 'bigotry', 'apartheid', 'discrimination', 'segregation', 'unfairness', 'partiality', 'sectarianism', 'colored']
women = ['women','girl', 'daughter', 'mother', 'she', 'wife', 'aunt', 'gentlewoman', 'girlfriend', 'grandmother', 'matron', 'niece', 'spouse', 'miss', 'genre']
money = ['money','bill', 'capital', 'cash', 'check', 'fund', 'pay', 'payment', 'property', 'salary', 'wage', 'wealth', 'banknote', 'bankroll', 'bread', 'bucks', 'chips', 'coin', 'coinage', 'dough', 'finances', 'funds', 'gold', 'gravy', 'greenback', 'loot', 'pesos', 'ressources', 'riches', 'roll', 'silver', 'specie', 'treasure', 'wad', 'wherewithal']
revolution = ['revolution','change', 'overthrow', 'demand', 'freedom', 'war', 'movement', 'brotherhood', 'reform', 'radical', 'leadership']
politics =  ['politics', 'president', 'governor', 'senator', 'campaigning','government','civics','electioneering','legislature','policy','political']
religion = ['religion', 'religious', 'religions', 'atheism', 'secular', 'islam', 'islamic', 'atheist', 'bible', 'christian', 'jew', 'muslim', 'theology', 'god', 'church', 'buddhism', 'hinduism','belief', 'pray', 'prayer', 'worship']
art = ['art', 'movie', 'singing', 'painting', 'ballet', 'theatre']
health = ['health', 'nutrition', 'medical', 'wellness', 'healthy', 'care', 'safety', 'fitness', 'obesity', 'cancer', 'sickness', 'disease'] 

# make lists so one can iterate through the topics
name_of_topics = ['racism', 'women', 'money', 'revolution', 'politics', 'religion', 'art', 'health']
words_defining_topics = [racism, women, money, revolution, politics, religion, art, health]

Calculate every words relation to the different topics


In [38]:
vocab_topics = HL.vocabulary_calculate_topics(words_defining_topics, name_of_topics, global_vectors)


racism
women
money
revolution
politics
religion
art
health

In [39]:
visual_vocab = vocab_topics.copy(deep=True)
print(visual_vocab.shape)

display(visual_vocab.sort_values('topic_revolution',ascending=False).head(15))


(11624, 12)
Word Count Occurence_percentage Weight topic_racism topic_women topic_money topic_revolution topic_politics topic_religion topic_art topic_health
23095 pakistan 49.0 0.000089 11257.823469 0 0 0 1 0 0 0 0
19726 congress 62.0 0.000112 8897.312097 0 0 0 1 1 0 0 0
967 vision 5269.0 0.009552 104.694126 0 0 0 1 0 0 0 0
1895 religion 2113.0 0.003830 261.066422 1 0 0 1 0 1 0 0
10198 reform 175.0 0.000317 3152.190571 0 0 0 1 1 0 0 0
11307 president 149.0 0.000270 3702.237248 0 0 0 1 1 0 0 0
1320 action 3447.0 0.006249 160.032884 0 0 0 1 0 0 0 0
8595 patriot 226.0 0.000410 2440.855531 0 0 0 1 0 0 0 0
35595 feudal 25.0 0.000045 22065.334000 1 0 0 1 0 0 0 0
19050 sanction 66.0 0.000120 8358.081061 0 0 0 1 0 0 0 0
2922 threat 1127.0 0.002043 489.470586 0 0 0 1 0 0 0 0
24717 implement 44.0 0.000080 12537.121591 0 0 0 1 1 0 0 0
3244 driven 965.0 0.001749 571.640777 0 0 0 1 0 0 0 0
31762 personnel 30.0 0.000054 18387.778333 0 0 0 1 0 0 0 0
41651 radical 20.0 0.000036 27581.667500 0 0 0 1 0 0 0 0

Extrapolate from (word <-> topic)-relation to (song <-> topic)-relation


In [40]:
# dataframe with songs
matches = pd.read_pickle('full_table.pkl')

In [41]:
display(matches.head(4))


Tid Latitude Longitude Artist_Name genre Title MXM_Tid year words_freq Bags_of_words
0 TRAAAAV128F421A322 37.77916 -122.42005 Western Addiction Pop_Rock A Poor Recipe For Civic Cohesion 4623710 2005 [1:6, 2:4, 3:2, 4:2, 5:5, 6:3, 7:1, 8:1, 11:1,... i i i i i i the the the the you yo...
1 TRAAABD128F429CF47 35.14968 -90.04892 The Box Tops Pop_Rock Soul Deep 6477168 1969 [1:10, 3:17, 4:8, 5:2, 6:2, 7:1, 8:3, 9:2, 10:... i i i i i i i i i i you you you ...
2 TRAAAEF128F4273421 35.83073 -85.97874 Adam Ant Pop_Rock Something Girls 3759847 1982 [1:5, 2:4, 3:3, 4:2, 5:1, 6:11, 9:4, 12:9, 13:... i i i i i the the the the you you ...
3 TRAAAHJ128F931194C 39.74001 -104.99226 Devotchka Pop_Rock The Last Beat Of My Heart (b-side) 5133845 2004 [1:4, 2:11, 3:2, 4:7, 5:3, 6:5, 8:1, 9:3, 10:6... i i i i the the the the the the the...

Find out if the songs are about any of the topics and assign values in the dataframe


In [43]:
matches, occz, totz = HL.score_songs(matches, vocab_topics)

In [45]:
print("Percentage of the songs that is not in the vocabulary: %.2f%%" % ((occz/totz)*100))


Percentage of the songs that is not in the vocabulary: 65.15%

In [46]:
#Fetch the columnnames of the topics
column_names = [col for col in vocab_topics.columns if col.startswith('topic')]
for col in column_names:
    col_average = np.mean(matches[col])
    print(np.mean(matches[col]))
    matches[col] = matches[col].apply(lambda x: 1 if x>col_average else 0)


0.0004855860355439085
0.008619094213475184
0.0036768540716145665
0.004160959722150218
0.00024649280481682734
0.010195056976289656
0.0027011381547442557
0.0025004970332949835

In [47]:
# this is how the dataframe looks after manipuation
visual_matches = matches.copy(deep=True)
display(visual_matches.sort_values('topic_racism',ascending=False).head(40))


Tid Latitude Longitude Artist_Name genre Title MXM_Tid year words_freq Bags_of_words topic_racism topic_women topic_money topic_revolution topic_politics topic_religion topic_art topic_health
26702 TRRVRKZ128F422F360 42.31256 -71.08868 Bury Your Dead Pop_Rock Let Down Your Hair (Album Version) 5229968 2006 [1:11, 2:3, 3:8, 4:2, 5:7, 7:4, 8:4, 9:1, 10:2... i i i i i i i i i i i the the th... 1 1 0 1 0 1 0 0
30923 TRUTCXT128F934B6B3 29.95369 -90.07771 Goatwhore Pop_Rock Diabolical Submergence Of Rebirth 5424744 2006 [2:18, 4:10, 5:1, 6:2, 8:1, 10:12, 11:2, 12:1,... the the the the the the the the the ... 1 0 0 1 0 0 0 1
2341 TRBOTOT128F932EA4C 51.50632 -0.12714 Current 93 Pop_Rock Alone 2155342 1987 [1:9, 2:18, 3:1, 4:7, 5:11, 6:9, 7:7, 8:6, 9:3... i i i i i i i i i the the the the... 1 0 1 1 1 1 0 0
2339 TRBOTKX128F42BB7E9 50.97768 11.02307 Yvonne Catterfeld Pop_Rock Die Zeit ist reif 5585229 2006 [10:8, 20:5, 21:2, 22:1, 54:9, 97:19, 122:2, 1... in in in in in in in in am am am a... 1 0 0 1 0 0 0 0
7003 TRESRJI128E079340A 51.50632 -0.12714 Ms. Dynamite RnB It Takes More 9939995 2002 [1:9, 2:13, 3:18, 4:29, 5:11, 6:17, 7:19, 8:10... i i i i i i i i i the the the the... 1 1 1 1 0 0 0 0
18834 TRMRVHI128F146C70D 50.77813 6.08849 LaFee Pop_Rock Virus 5442621 2006 [10:4, 22:5, 28:4, 54:2, 97:6, 102:1, 122:5, 1... in in in in all all all all all so ... 1 0 0 0 0 0 0 0
20663 TRNXIQU128F92DC556 50.11204 8.68342 Rapsoul Rap Sag ja 6453491 2007 [10:3, 21:3, 28:3, 54:4, 97:6, 102:8, 122:4, 1... in in in will will will so so so was... 1 0 0 0 0 0 0 0
26041 TRRKFXU128F932E72D 51.50632 -0.12714 Echobelly Pop_Rock Paradise 1258395 1997 [1:5, 2:8, 3:4, 4:3, 5:9, 6:3, 8:11, 9:15, 10:... i i i i i the the the the the the ... 1 0 0 1 0 1 0 0
35017 TRXOWVZ128F42AED4F 53.55334 9.99245 Revolverheld Pop_Rock Generation Rock 4941099 2005 [20:2, 22:3, 28:1, 54:4, 73:1, 97:9, 102:5, 12... am am all all all so was was was was... 1 0 0 0 0 0 0 0
4615 TRDCMFH128F429626D 50.77813 6.08849 LaFee Pop_Rock Jetzt Erst Recht 7368749 2007 [1:1, 20:1, 21:4, 28:5, 54:4, 97:2, 122:5, 189... i am will will will will so so so so... 1 1 0 0 0 0 0 0
30893 TRUSNSW128F42B7BB8 40.71455 -74.00712 The Bravery Pop_Rock Fearless 5616161 2004 [1:17, 2:10, 3:12, 4:5, 5:10, 6:4, 7:9, 8:4, 9... i i i i i i i i i i i i i i i ... 1 0 0 1 0 0 0 0
24794 TRQQCDY128F92C57C5 36.99462 -86.44558 Nappy Roots Rap Nappy Roots Day (Explicit Album Version) 1590753 2003 [1:20, 2:34, 3:5, 4:20, 5:14, 6:17, 7:1, 8:12,... i i i i i i i i i i i i i i i ... 1 0 1 1 0 0 0 1
8743 TRFVPIE128F92D8813 51.16418 10.45415 Funny Van Dannen Pop_Rock Mode 1982738 1997 [10:5, 22:1, 28:1, 54:1, 97:2, 102:8, 158:2, 2... in in in in in all so was die die u... 1 0 0 1 0 0 0 0
8744 TRFVQAN128F1476ABC 50.77813 6.08849 LaFee Pop_Rock Sterben Für Dich 5329925 2006 [10:2, 22:1, 28:1, 140:1, 181:11, 189:11, 218:... in in all so hand du du du du du du... 1 0 0 0 0 0 0 0
5684 TRDUNRO12903CF2530 43.04999 -76.14739 Brand New Sin Pop_Rock Dead Man Walking 3987590 2005 [2:1, 3:2, 5:1, 6:4, 9:3, 12:2, 16:2, 18:2, 24... the you you and a a a a not not not... 1 0 0 1 0 1 0 0
30888 TRUSMSY128F42758F0 51.50632 -0.12714 The Waterboys Pop_Rock Further Up Further In (2008 Digital Remaster) 744614 1990 [1:16, 2:22, 3:3, 4:7, 5:6, 6:8, 7:2, 8:3, 10:... i i i i i i i i i i i i i i i ... 1 1 1 1 0 1 0 0
4606 TRDCIED128F421CD32 60.20624 24.65620 Children Of Bodom Pop_Rock Roadkill Morning 6926817 2008 [1:12, 2:9, 3:10, 4:11, 5:3, 6:7, 7:8, 8:3, 9:... i i i i i i i i i i i i the the ... 1 1 1 1 0 1 0 0
12745 TRIPJXI128F934C109 35.96049 -83.92091 Whitechapel Pop_Rock Father Of Lies 7318239 2008 [1:10, 2:10, 3:4, 5:1, 7:4, 11:6, 13:10, 14:10... i i i i i i i i i i the the the ... 1 0 0 1 0 1 0 1
23782 TRPYYNO128F92F4337 50.11204 8.68342 Rapsoul Rap Du siehst es doch genauso 6453489 2007 [10:5, 22:10, 28:16, 46:13, 54:12, 73:13, 76:3... in in in in in all all all all all ... 1 0 0 0 0 0 0 0
37253 TRZCUIA128F424CB62 45.19398 5.73200 Miss Kittin Electronic Sunset Strip 6877932 2008 [2:9, 4:2, 5:2, 6:3, 8:1, 12:1, 13:3, 17:5, 19... the the the the the the the the the ... 1 1 0 0 0 1 0 0
8197 TRFMRFB128F42BB7F7 50.97768 11.02307 Yvonne Catterfeld Pop_Rock Ich lauf einfach los 5585237 2006 [21:2, 22:2, 28:1, 54:5, 97:4, 102:2, 122:1, 1... will will all all so was was was was ... 1 0 0 0 0 0 0 0
1278 TRAVHEP128F934B52A 42.31256 -71.08868 The Red Chord Pop_Rock Fixation On Plastics 5483558 2005 [1:2, 2:4, 3:1, 4:5, 5:5, 6:13, 7:2, 8:6, 9:1,... i i the the the the you to to to to... 1 0 0 0 0 0 0 0
11979 TRIBVBW128F92EA87D 53.41961 -8.24055 Bell X1 Pop_Rock The Ribs Of A Broken Umbrella (Album Version) 8043244 2009 [1:3, 2:11, 4:5, 5:8, 6:12, 8:5, 10:4, 12:1, 1... i i i the the the the the the the t... 1 0 0 0 0 0 0 0
3203 TRCEPWK128F42365A9 40.85251 -73.13585 Glassjaw Pop_Rock Pretty Lush (Album Version) 7636880 2000 [1:12, 2:8, 3:23, 4:11, 5:13, 6:6, 7:2, 8:4, 1... i i i i i i i i i i i i the the ... 1 1 0 1 0 1 0 0
4086 TRCTRQA128F92C1FAC 53.55334 9.99245 Revolverheld Pop_Rock Gegen die Zeit 6231841 2007 [10:8, 22:2, 54:17, 97:12, 102:9, 106:1, 122:1... in in in in in in in in all all was... 1 0 0 0 0 0 0 0
7796 TRFGIKW128F42AA0BB 33.79502 9.56154 Dany Brillant Pop_Rock Dans Les Rues De Rome 2100401 2001 [1:1, 6:2, 7:2, 17:6, 38:3, 42:9, 47:11, 117:4... i a a me me on on on on on on que ... 1 0 0 0 0 0 0 0
28205 TRSWAMW12903CE6AAB 42.28474 -83.38348 Insane Clown Posse Pop_Rock 17 Dead 1226405 1993 [1:39, 2:36, 3:20, 4:8, 5:17, 6:22, 7:5, 8:9, ... i i i i i i i i i i i i i i i ... 1 0 1 0 0 1 1 0
11069 TRHLPXG12903CFED67 34.05349 -118.24532 Threshold Pop_Rock Sanity's End (live) 5172893 1994 [1:6, 2:15, 3:5, 4:6, 5:10, 6:8, 7:2, 9:6, 10:... i i i i i i the the the the the th... 1 0 0 0 0 0 0 0
4626 TRDCSDL128F9355D56 40.65507 -73.94888 The Cardigans Pop_Rock Been It 1520922 1996 [1:24, 3:12, 4:5, 5:2, 6:1, 7:10, 8:1, 10:1, 1... i i i i i i i i i i i i i i i ... 1 0 0 1 0 1 0 0
5668 TRDUIDB128F9303D0C 33.62646 -80.94740 Pur Pop_Rock Herz Für Kinder. 2614723 1990 [10:2, 20:3, 21:1, 28:1, 54:3, 97:7, 122:5, 15... in in am am am will so was was was ... 1 0 0 1 0 0 0 0
625 TRAKMJW128F425F12C 57.65337 14.69725 Backyard Babies Pop_Rock Be Myself And I 2251787 2003 [1:31, 3:27, 4:5, 5:2, 6:7, 8:2, 9:9, 11:2, 12... i i i i i i i i i i i i i i i ... 1 0 0 1 0 1 0 0
37227 TRZCKKT128F9303FF0 52.51607 13.37698 Element Of Crime Pop_Rock Alle Vier Minuten 769450 2001 [10:8, 21:1, 22:2, 28:1, 97:10, 102:4, 106:5, ... in in in in in in in in will all al... 1 0 0 1 0 0 0 0
37228 TRZCKTV128F92D3830 51.16418 10.45415 Glashaus RnB Liebst Du mich? 1124618 2001 [10:2, 21:2, 22:4, 28:4, 54:4, 97:1, 122:1, 15... in in will will all all all all so s... 1 0 0 0 0 0 0 0
28264 TRSXBZE12903CED21E 40.71455 -74.00712 Riot Pop_Rock Storming The Gates Of Hell 2344709 1990 [1:3, 2:19, 4:2, 5:9, 6:2, 10:2, 12:2, 13:11, ... i i i the the the the the the the t... 1 0 1 1 0 1 0 1
3585 TRCKOSL128F4226F6D 35.96049 -83.92091 Whitechapel Pop_Rock Devirgination Studies 6346182 2007 [1:3, 2:2, 3:3, 4:7, 5:2, 9:2, 10:2, 11:5, 12:... i i i the the you you you to to to ... 1 1 0 1 0 1 0 0
36197 TRYJBUK128F92C1FC0 53.55334 9.99245 Revolverheld Pop_Rock Hallo Welt 7183860 2007 [10:2, 22:26, 28:2, 97:2, 123:1, 158:3, 181:1,... in in all all all all all all all al... 1 0 0 1 0 0 0 0
36198 TRYJCFO128F92E5269 52.82812 12.07305 Annett Louisan Pop_Rock Die sein 6540547 2007 [10:9, 28:3, 97:19, 122:2, 189:1, 218:6, 226:2... in in in in in in in in in so so s... 1 0 0 0 0 0 0 0
20035 TRNNBNM128F4277EB9 34.05349 -118.24532 L7 Pop_Rock Ms. 45 592696 1988 [2:6, 3:7, 5:3, 6:5, 8:1, 9:3, 12:9, 14:1, 16:... the the the the the the you you you ... 1 0 0 1 0 1 0 0
3175 TRCEETX128F1491B7E 53.38311 -1.46454 Bring Me The Horizon Pop_Rock [I Used To Make Out With] Medusa 5607564 2006 [1:8, 2:9, 3:13, 4:6, 5:3, 6:1, 7:8, 9:6, 10:4... i i i i i i i i the the the the t... 1 0 0 1 0 1 0 0
622 TRAKLWA128F4296E3B 36.16778 -86.77836 Laurent Voulzy International Qui Est In Qui Est Out 6204529 1979 [6:1, 7:1, 38:1, 42:1, 47:3, 77:1, 90:1, 112:1... a me que de la la la y en te tu tu... 1 0 0 0 0 0 0 0

In [ ]:
matches.to_excel("final_table.xls")

In [ ]: